The final project for the York Machine Learning course, Machine Learning in Business Context was to apply all that we have learned throughout the course towards one data set/project. We chose a dataset about the Housing Prices in Beijing. This report follows the structures laid out in CRISP-DM methodology.
The GitHub repository for all the source code is located at the following link: https://github.com/stellarclass/Beijing-Housing.
Due to technical reasons on shinyapps.io, we are not able to upload the app to the website due to its limitation on memory usage. But the app can be viewed locally by running server.R in this same folder where this report resides.
Being able to predict the housing prices in the future has many uses. One can use this information to plan when and where to buy a house. It can also be used to plan one’s investments. Being able to buy a house when prices are cheaper and sell when prices are higher is one way to make money, after all.
Another aspect from this modeling tasks is to answer the question as to what constitutes as the most important factors as the housing affordability in Beijing. One member of this group used to reside in this city. It’d interesting to compare human’s perception of the housing price with the results produced by an optimized model.
The data set was provided courtesy of Qichen Qiu who scraped the Beijing housing site, Lianjia.com.
The data contains various aspects of a house that prospective buyers would be interested in, some date-time attributes, and the location of the building.
The following table shows the house properties:
| Attribute | Description |
|---|---|
| followers | the number of people who follow the transaction (an indication of popularity) |
| totalPrice | the total price of the house |
| price | the average price per square metre |
| square | the square metres in the house |
| livingRoom | the number of bedrooms (translation error) |
| drawingRoom | the number of living rooms (translation error) |
| kitchen | the number of kitchens |
| bathroom | the number of bathrooms |
| floor | the floor on which the house is located, includes where this floor is located in the building (e.g. top, bottom, etc.) |
| buildingType | the type of building (e.g. tower, bungalow, etc.) |
| constructionTime | the year the building was constructed |
| renovationCondition | the condition of the exterior |
| buildingStructure | what the building is made of |
| ladderRatio | how many ladders are present per resident |
| elevator | if there is an elevator or not |
| subway | if a subway station is nearby or not |
| communityAverage | the average price of a house in that community |
The following table shows the date-time attributes:
| Attribute | Description |
|---|---|
| tradeTime | the date of the transaction |
| DOM | the active days on market (indicates how quickly a house sold) |
| fiveYearsProperty | if the owner of the property has owned it for greater than or less than 5 years |
The following table shows the location attributes:
| Attribute | Description |
|---|---|
| Lng | the longitude of the house |
| Lat | the latitude of the house |
| Cid | the community ID |
| district | the district where the house is located |
There are also two columns that identify the house listing on the website:
| Attribute | Description |
|---|---|
| url | the url of the house/transaction from where data was scraped |
| id | the ID of the transaction |
An important consideration for this data set is that some important information is displayed in Chinese characters. This information would have been lost if we did not have a team member whose computer could render the characters properly and could understand Chinese.
There are some cleanup to do before we can even start to do visulization. First of all, we want to convert all Chinese characters in the dataset into English.
### Loading data
data=read.csv(paste(data_dir,"original_data.csv",sep="/"), header = TRUE, sep = ",",na.strings = c("nan","#NAME?"), fileEncoding="gbk")
data$floor = gsub("高", "high", data$floor)
data$floor = gsub("低", "low", data$floor)
data$floor = gsub("中", "mid", data$floor)
data$floor = gsub("底", "bottom", data$floor)
data$floor = gsub("顶", "low", data$floor)
data$floor = gsub("未知", "unkown", data$floor)
data$floor = gsub("钢混结构", "steel-concrete composite", data$floor)
data$floor = gsub("混合结构", "mixed", data$floor)
data$drawingRoom = gsub("中", "mid", data$drawingRoom)
data$drawingRoom = gsub("高", "high", data$drawingRoom)
data$drawingRoom = gsub("低", "low", data$drawingRoom)
data$drawingRoom = gsub("底", "bottom", data$drawingRoom)
data$drawingRoom = gsub("顶", "top", data$drawingRoom)
data$constructionTime = gsub("未知", NA, data$constructionTime)
data$bathRoom = gsub("未知", NA, data$bathRoom)
data$Cid = as.factor(data$Cid)
We also find that, through inspecting some of the original URL, several data seems to be misplaced into wrong columns such that off-the-chart values are present in these columns. For example, some rows have 2006 bathrooms. It seems more appropriate to say that the construction time for this row is 2006. Moreover, all these rows have missing values. We infer that these bad data points probably result from bugs in the crawler. There are only a handful of such data points. So we drop them directly.
data = data[-which(is.na(data$livingRoom)),]
The floor type(top floor, mid floor, bottom floor) might be important for the price prediction. Therefore we also need to split the floor data out to have height and type.
#split floor to have height info
temp <- as.data.frame(str_split_fixed(data$floor, " ", 2))
colnames(temp) <- c("floorType", "floorLevel")
data <- cbind(data, temp)
data$floorLevel=as.integer(data$floorLevel)
data <- data[, !colnames(data) %in% c("floor")]
There appears to be some outliers in the ladder ration:
boxplot(data$ladderRatio, las = 1, ylim=c(0,2))
It does not make sense to have more than one ladder per unit, so anything greater than a 1:1 ratio is considered an outlier, and we will now convert it to NA. The renovation condition also cannot be 0, so we change that to NA as well.
data$ladderRatio[data$ladderRatio > 1] <- NA
data$renovationCondition[data$renovationCondition == 0] <- NA
Now we have some NA values to deal with.
apply(data,2,function(x){sum(is.na(x))/length(x)*100})
## url id Lng
## 0.0000000 0.0000000 0.0000000
## Lat Cid tradeTime
## 0.0000000 0.0000000 0.0000000
## DOM followers totalPrice
## 49.5484899 0.0000000 0.0000000
## price square livingRoom
## 0.0000000 0.0000000 0.0000000
## drawingRoom kitchen bathRoom
## 0.0000000 0.0000000 0.0000000
## floor buildingType constructionTime
## 0.0000000 0.6339020 6.0482594
## renovationCondition buildingStructure ladderRatio
## 0.0000000 0.0000000 0.3230673
## elevator fiveYearsProperty subway
## 0.0000000 0.0000000 0.0000000
## district communityAverage
## 0.0000000 0.1452235
We will use mice imputation to impute these missing values. We run the imputation on the data, excluding some less useful information in order to have it run faster. We also use the quick prediction option to help speed things up. Since the imputation takes very long to run (>=10 hours), we have taken the task offline and the following codes are merely for the purpose of documentation. Later when we start modeling, we will load the imputated data for our analysis.
DO_IMPUTE = F
if(DO_IMPUTE) {
mice_mod <- mice(data[, !names(data) %in% c('Lng', 'Lat', 'Cid', 'followers', 'totalPrice', 'price')], pred = quickpred(data[, !names(data) %in% c('Lng', 'Lat', 'Cid', 'followers', 'totalPrice', 'price')], mincor=0.1, minpuc=0.3))
# Save the complete output
mice_output <- complete(mice_mod)
}
We can now plot the distributions to see how the imputation looks and to make sure nothing looks out of place:
#read in data because of length of time to impute
mice_output <- read.csv(paste(data_dir,"mice_output.csv",sep="/"), header = TRUE, sep = ",")
# Plot distributions
par(mfrow=c(1,2))
hist(data$DOM, freq=F, main='Original Data',
col='darkgreen', ylim=c(0,0.04))
hist(mice_output$DOM, freq=F, main='MICE Output',
col='lightgreen', ylim=c(0,0.04))
hist(as.numeric(data$buildingType), freq=F, main='Original Data',
col='darkgreen', ylim=c(0,0.04))
hist(as.numeric(mice_output$buildingType), freq=F, main='MICE Output',
col='lightgreen', ylim=c(0,0.04))
hist(data$ladderRatio, freq=F, main='Original Data',
col='darkgreen', ylim=c(0,0.04))
hist(mice_output$ladderRatio, freq=F, main='MICE Output',
col='lightgreen', ylim=c(0,0.04))
data$constructionTime = as.integer(data$constructionTime)
hist(data$constructionTime, freq=F, main='Original Data',
col='darkgreen', ylim=c(0,0.04))
hist(mice_output$constructionTime, freq=F, main='MICE Output',
col='lightgreen', ylim=c(0,0.04))
Everything looks fine, so we can now assign the imputed values to the data set:
data$DOM <- mice_output$DOM
data$buildingType <- mice_output$buildingType
data$ladderRatio <- mice_output$ladderRatio
data$communityAverage <- mice_output$communityAverage
data$constructionTime <- mice_output$constructionTime
We first load the impuated, cleaned data for our visualization tasks. Then some additional preprocessing are also done for visualization.
dataVisualize=read.csv(paste(data_dir,"imputed_data.csv",sep="/"), header = TRUE, sep = ",")
# Adding 2 attributes Month and Year for Selling Time
dataVisualize$tradeYear <- as.integer(year(ymd(dataVisualize$tradeTime)))
dataVisualize$tradeMonth <- as.integer(month(ymd(dataVisualize$tradeTime)))
df3 <- data.frame(dataVisualize %>% na.omit())
df3$buildingType <- as.factor(df3$buildingType)
df3$buildingStructure <- as.factor(df3$buildingStructure)
df3$elevator <- as.factor(df3$elevator)
df3$fiveYearsProperty <- as.factor(df3$fiveYearsProperty)
df3$subway <- as.factor(df3$subway)
df3$district <- as.factor(df3$district)
df3$renovationCondition <- as.factor(df3$renovationCondition)
df3$tradeTimeTs <- as.Date(df3$tradeTime, format = "%Y-%m-%d")
df3$monthlyTradeTS <- as.Date(paste0(df3$tradeYear,'-',df3$tradeMonth,'-01'))
We can observe the Beijing Housing Market on total trading volume, trading value and average trading total proce over years to have a general view on this market trend.
year_data <- subset(dataVisualize,!(dataVisualize$tradeYear < 2011))
all_group <- group_by(year_data, district, tradeYear)
beijing_by_volume <- dplyr::summarise(all_group, n=n())
ggplot(aes(x = tradeYear , y = n, fill = district), data = beijing_by_volume) +
geom_bar(stat = 'identity', position = 'stack', width = 0.5) +
coord_flip() +
xlab('Year') +
ylab('Trading Volume') +
theme_minimal() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4))
year_data <- subset(dataVisualize,!(dataVisualize$tradeYear < 2011))
all_group <- group_by(year_data, district, tradeYear)
beijing_by_value <- dplyr::summarise(all_group, value=sum(totalPrice)/1000)
ggplot(aes(x = tradeYear , y = value, fill = district), data = beijing_by_value) +
geom_bar(stat = 'identity', position = 'stack', width = 0.5) +
coord_flip() +
xlab('Year') +
ylab('Trading Value (1000 Yuan)') +
theme_minimal() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4))
all_group <- group_by(dataVisualize, tradeYear)
beijing_by_average_price <- dplyr::summarise(all_group, average=mean(totalPrice))
ggplot(aes(x=tradeYear, y=average), data =beijing_by_average_price) +
geom_line(size=1.5) +
ylab('Year') +
xlab('Average Total Price (10k Yuan)') +
theme_grey() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))
Look deeper into the comparison between districts by year, we can visualize how the market different over districts through years on both volume, value and average square price. Please view our application to view more options on date range and let’s take a look on the best year of the market - 2016 for an example here.
compareData <- subset(dataVisualize, dataVisualize$tradeYear == 2016)
all_group <- group_by(compareData, district, tradeYear)
beijing_by_volume <- dplyr::summarise(all_group, n=n())
ggplot(aes(x = district , y = n, fill = n), data = beijing_by_volume) +
geom_bar(stat = 'identity', width = 0.5) +
xlab('District') +
ylab('Trading Volume') +
theme_minimal() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4))
all_group <- group_by(compareData, district)
beijing_by_value <- dplyr::summarise(all_group, value=sum(totalPrice)/1000)
ggplot(aes(x = district , y = value, fill = value), data = beijing_by_value) +
geom_bar(stat = 'identity', width = 0.5) +
xlab('District') +
ylab('Trading Value (1000 Yuan)') +
theme_minimal() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4))
mypalette <- colorRampPalette(brewer.pal(12,'Paired'))(13)
mycols <- 3
mybox <- compareData %>% ggplot(aes_string('district','price')) + geom_boxplot(aes_string(color='district')) +
scale_color_manual(name='',values=mypalette) + theme_minimal(12) +
theme(axis.title =element_blank(), legend.position='None') +
coord_flip()
grid.arrange(mybox, ncol=1)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
We also can take a look on some typical categorical attributes to see how square price is depended on these features.
makeFeatureCatEDA <- function(x){
mypalette <-'Paired'
mycols <- 2
mybox <- df3 %>% ggplot(aes_string(x,'price')) + geom_boxplot(aes_string(color=x)) +
scale_color_brewer(name='', palette=mypalette) + theme_minimal(12) +
theme(axis.title =element_blank(), legend.position='None') +
labs(title='Average Price Per Square') #+ coord_flip()
grid.arrange(mybox, ncol=1)
}
makeFeatureCatEDA('buildingStructure')
makeFeatureCatEDA('buildingType')
makeFeatureCatEDA('renovationCondition')
makeFeatureCatEDA('elevator')
makeFeatureCatEDA('subway')
makeFeatureCatEDA('fiveYearsProperty')
### (2) Data correlations Our imputed dataset has been cleaned and transformed all attributes into numeric or binary values, so we will investigate the correlations between features on this dataset. For a better view, please visit our application.
corrplot(cor(
data %>% select_if(is.numeric) %>% select(-Lng, -Lat) ,use = "pairwise.complete.obs",
method='pearson')
,method='ellipse',
tl.cex = 1,
col = viridis::viridis(50),
tl.col='black')
Let’s re-read the data and do some type conversion for the coming modeling tasks.
data=read.csv(paste(data_dir,"imputed_data.csv",sep="/"), header = TRUE, sep = ",")
# Convert to useful type
data$tradeTime = ymd(data$tradeTime)
data$year = year(data$tradeTime)
data = data[, !colnames(data) %in% c("tradeTime")]
data$drawingRoom = as.integer(data$drawingRoom)
data$bathRoom = as.integer(data$bathRoom)
data$buildingType = as.factor(data$buildingType)
data$constructionTime = as.integer(data$constructionTime)
data$renovationCondition = as.factor(data$renovationCondition)
data$buildingStructure = as.factor(data$buildingStructure)
data$elevator = as.factor(data$elevator)
data$fiveYearsProperty = as.factor(data$fiveYearsProperty)
data$subway = as.factor(data$subway)
data$district=as.factor(data$district)
# non informative columns; or factors that have too many levels like Cid
data = data[,!colnames(data) %in% c("url","id","Cid", "totalPrice", "X")]
This section of the report will be divided into two parts: first building a clusters, second building a predictive models for the house price.
We use clustering to explore the data a bit further. First, we create a data frame which contains the attributes we are interested in, namely everything except location data. This is because we want to see if there’s a relationship with the other attributes, not where the houses are located. Then we run kproto to produce clusters.
Since the dataset is not trivial for the clustering algorithms, we have taken it offline but show the codes for documentation purpose.
DO_CLUSTER = F
if(DO_CLUSTER) {
data_clust <- data[, !colnames(data) %in% c("Lng", "Lat", "Cid", "district")]
set.seed(1234)
kproto_clusters = list()
for (i in seq(2,8,1)) {
kproto_clusters = list.append(kproto_clusters,kproto(data_clust,i,lambda = 1))
}
wss = c()
for(kc in kproto_clusters) {
wss = c(wss,kc$tot.withinss)
}
}
wss <- readRDS(paste(data_dir,"wss.RDS",sep="/"))
plot(seq(2,8,1), wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
We can see that 3 or 4 clusters is a good number of clusters.
# Since we have not run the model online, don't run this line!
if(F) kproto_selection = kproto_clusters[[3]]
Let’s map the clusters to see what it looks like:
kproto_selection <- readRDS(paste(data_dir,"kproto_sel.RDS",sep="/"))
data3 <- data
data3$cluster_id = as.factor(kproto_selection$cluster)
bj_map <- data.frame(data3$price, data3$Lat, data3$Lng, data3$cluster_id)
colnames(bj_map) <- c('price', 'lat', 'lon', 'cluster_id')
sbbox <- make_bbox(lon = data3$Lng, lat = data3$Lat, f = 0.05)
my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="color", zoom = 10)
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=39.939895,116.40245&zoom=10&size=640x640&scale=2&maptype=roadmap&language=en-EN&key=AIzaSyA58Q72lPu5mGHwGWSKdZU8EURqH2QzcU8
ggmap(my_map) +
geom_point(data=bj_map, aes(x = lon, y = lat, color = cluster_id),
size = 0.5, alpha = 1) +
xlab('Longitude') +
ylab('Latitude') +
ggtitle('Clusters on Map')
There is a clear pattern to these clusters! Let’s take a closer look at the attributes in these clusters:
kproto_results <- data3 %>%
dplyr::select(-price,-Lng,-Lat,-district) %>%
group_by(cluster_id) %>%
do(the_summary = summary(.))
print(kproto_results$the_summary)
## [[1]]
## DOM followers square livingRoom
## Min. : 1.00 Min. : 0.00 Min. : 8.50 Min. :0.000
## 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 54.36 1st Qu.:1.000
## Median : 1.00 Median : 5.00 Median : 65.56 Median :2.000
## Mean : 15.39 Mean : 16.43 Mean : 77.88 Mean :1.966
## 3rd Qu.: 4.00 3rd Qu.: 18.00 3rd Qu.: 90.00 3rd Qu.:2.000
## Max. :1401.00 Max. :1015.00 Max. :640.00 Max. :9.000
## drawingRoom kitchen bathRoom
## Min. :0.000 Min. :0.0000 Min. :0.00
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:1.00
## Median :1.000 Median :1.0000 Median :1.00
## Mean :1.113 Mean :0.9971 Mean :1.17
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.00
## Max. :4.000 Max. :3.0000 Max. :7.00
## buildingType constructionTime renovationCondition
## bungalow : 34 Min. :1950 hardcover :23839
## plate :30324 1st Qu.:1990 other :22553
## plate-tower combination:12981 Median :1998 rough : 800
## tower :19520 Mean :1997 simplicity:15667
## 3rd Qu.:2004
## Max. :2016
## buildingStructure ladderRatio elevator
## brick and concrete : 3160 Min. :0.0200 no :23402
## brick and wood : 37 1st Qu.:0.2500 yes:39457
## mixed :19794 Median :0.3330
## steel : 29 Mean :0.3532
## steel-concrete composite:39826 3rd Qu.:0.5000
## unknown : 13 Max. :1.0000
## fiveYearsProperty subway communityAverage floorType
## no :19151 no :13466 Min. : 55551 bottom: 4948
## yes:43708 yes:49393 1st Qu.: 77867 high :13858
## Median : 85645 low :20222
## Mean : 87243 mid :23511
## 3rd Qu.: 94022 unkown: 320
## Max. :183109
## floorLevel year cluster_id
## Min. : 1.0 Min. :2011 1:62859
## 1st Qu.:10.0 1st Qu.:2014 2: 0
## Median :18.0 Median :2015 3: 0
## Mean :21.2 Mean :2015 4: 0
## 3rd Qu.:36.0 3rd Qu.:2016
## Max. :40.0 Max. :2018
##
## [[2]]
## DOM followers square livingRoom
## Min. : 1.00 Min. : 0.00 Min. : 6.90 Min. :0.000
## 1st Qu.: 1.00 1st Qu.: 3.00 1st Qu.: 50.85 1st Qu.:1.000
## Median : 12.00 Median : 15.00 Median : 61.30 Median :2.000
## Mean : 34.19 Mean : 28.16 Mean : 72.10 Mean :1.961
## 3rd Qu.: 48.00 3rd Qu.: 37.00 3rd Qu.: 80.46 3rd Qu.:2.000
## Max. :784.00 Max. :1143.00 Max. :1745.50 Max. :9.000
## drawingRoom kitchen bathRoom
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:1.000
## Median :1.000 Median :1.0000 Median :1.000
## Mean :1.053 Mean :0.9933 Mean :1.125
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :4.000 Max. :3.0000 Max. :7.000
## buildingType constructionTime renovationCondition
## bungalow : 176 Min. :1906 hardcover :11295
## plate :14224 1st Qu.:1987 other : 4178
## plate-tower combination: 4585 Median :1994 rough : 433
## tower : 5806 Mean :1994 simplicity: 8885
## 3rd Qu.:2003
## Max. :2016
## buildingStructure ladderRatio elevator
## brick and concrete : 1482 Min. :0.0140 no :11916
## brick and wood : 196 1st Qu.:0.2500 yes:12875
## mixed :11071 Median :0.3330
## steel : 19 Mean :0.3585
## steel-concrete composite:11990 3rd Qu.:0.5000
## unknown : 33 Max. :1.0000
## fiveYearsProperty subway communityAverage floorType
## no : 9002 no : 4910 Min. : 34439 bottom:2610
## yes:15789 yes:19881 1st Qu.: 90890 high :4860
## Median :103114 low :7428
## Mean :105759 mid :9816
## 3rd Qu.:121898 unkown: 77
## Max. :183109
## floorLevel year cluster_id
## Min. : 1.00 Min. :2011 1: 0
## 1st Qu.: 9.00 1st Qu.:2016 2:24791
## Median :23.00 Median :2016 3: 0
## Mean :22.75 Mean :2016 4: 0
## 3rd Qu.:36.00 3rd Qu.:2017
## Max. :40.00 Max. :2018
##
## [[3]]
## DOM followers square livingRoom
## Min. : 1.00 Min. : 0.00 Min. : 12.50 Min. :0.000
## 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 57.40 1st Qu.:1.000
## Median : 1.00 Median : 7.00 Median : 71.95 Median :2.000
## Mean : 16.83 Mean : 20.03 Mean : 81.22 Mean :1.958
## 3rd Qu.: 13.00 3rd Qu.: 23.00 3rd Qu.: 96.37 3rd Qu.:2.000
## Max. :1352.00 Max. :1000.00 Max. :510.00 Max. :9.000
## drawingRoom kitchen bathRoom
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:1.000
## Median :1.000 Median :1.0000 Median :1.000
## Mean :1.159 Mean :0.9976 Mean :1.169
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :5.000 Max. :4.0000 Max. :6.000
## buildingType constructionTime renovationCondition
## bungalow : 12 Min. :1953 hardcover :45501
## plate :52591 1st Qu.:1994 other :36911
## plate-tower combination:23960 Median :2001 rough : 1848
## tower :36113 Mean :1999 simplicity:28416
## 3rd Qu.:2006
## Max. :2016
## buildingStructure ladderRatio elevator
## brick and concrete : 4143 Min. :0.0140 no :37016
## brick and wood : 13 1st Qu.:0.2500 yes:75660
## mixed :29529 Median :0.3330
## steel : 62 Mean :0.3637
## steel-concrete composite:78908 3rd Qu.:0.5000
## unknown : 21 Max. :1.0000
## fiveYearsProperty subway communityAverage floorType
## no :38676 no :39214 Min. : 30673 bottom: 7913
## yes:74000 yes:73462 1st Qu.: 56126 high :25974
## Median : 61195 low :35651
## Mean : 61456 mid :42802
## 3rd Qu.: 67440 unkown: 336
## Max. :100448
## floorLevel year cluster_id
## Min. : 1.00 Min. :2002 1: 0
## 1st Qu.:10.00 1st Qu.:2014 2: 0
## Median :18.00 Median :2015 3:112676
## Mean :21.34 Mean :2015 4: 0
## 3rd Qu.:36.00 3rd Qu.:2016
## Max. :40.00 Max. :2018
##
## [[4]]
## DOM followers square livingRoom
## Min. : 1.000 Min. : 0.00 Min. : 10.00 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.: 63.80 1st Qu.:2.000
## Median : 1.000 Median : 3.00 Median : 86.14 Median :2.000
## Mean : 9.534 Mean : 11.37 Mean : 90.34 Mean :2.094
## 3rd Qu.: 1.000 3rd Qu.: 11.00 3rd Qu.:106.53 3rd Qu.:3.000
## Max. :1677.000 Max. :1085.00 Max. :922.70 Max. :9.000
## drawingRoom kitchen bathRoom
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:1.000
## Median :1.000 Median :1.0000 Median :1.000
## Mean :1.241 Mean :0.9901 Mean :1.229
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :5.000 Max. :3.0000 Max. :7.000
## buildingType constructionTime renovationCondition
## bungalow : 6 Min. :1933 hardcover :36803
## plate :76772 1st Qu.:1998 other :55098
## plate-tower combination:18380 Median :2003 rough : 2309
## tower :23335 Mean :2002 simplicity:24283
## 3rd Qu.:2006
## Max. :2016
## buildingStructure ladderRatio elevator
## brick and concrete : 5556 Min. :0.0140 no :62509
## brick and wood : 7 1st Qu.:0.3000 yes:55984
## mixed :55385 Median :0.5000
## steel : 77 Mean :0.4049
## steel-concrete composite:57343 3rd Qu.:0.5000
## unknown : 125 Max. :1.0000
## fiveYearsProperty subway communityAverage floorType
## no :46160 no :69583 Min. :10847 bottom:10609
## yes:72333 yes:48910 1st Qu.:38959 high :25402
## Median :44385 low :38241
## Mean :44542 mid :43703
## 3rd Qu.:49524 unkown: 538
## Max. :87923
## floorLevel year cluster_id
## Min. : 1.00 Min. :2002 1: 0
## 1st Qu.:13.00 1st Qu.:2013 2: 0
## Median :34.00 Median :2015 3: 0
## Mean :25.09 Mean :2014 4:118493
## 3rd Qu.:36.00 3rd Qu.:2016
## Max. :40.00 Max. :2018
Cluster 1 is not on market for very long (15 days average but 1 day median!) meaning it is in high demand. The price is around 350-400, but it tends to be pretty small - 60-70 m2, although they are not bungalows and they tend to be in the middle of a building. Typically they were built in the 90’s and tend to have been owned for over 5 years. While they may or may not have an elevator, they are near a subway.
Cluster 2 is on the market for a while - 34 days average, 12 days median. This is likely because they are very expensive, at about 600-700! They are also quite cramped at 60-70 m2 but tend to be 2 bedroom units. The floor distribution is relatively even except for being on the bottom (not many are there). The buildings are generally plate type and built in the 90s. They are also near the subway, but may or may not have an elevator. They tend to be owned for over 5 years.
Cluster 3 is not on market for too long, but slightly longer than cluster 1 - 1 day median, 17 days mean. They are pretty cheap at around 300-350, and are slightly bigger than cluster 1, at 70-80 m2 and generally 2 bedrooms. They are not bungalows and tend to be slightly newer, being built in the 2000’s. This means they more often have elevators and are near the subway. They also are usually owned for over 5 years. There is a more even distribution of floors, but tend not to be on ground floor.
Cluster 4 sells very quickly - 1 day median, 10 days mean! They are also the cheapest at about 200-250. They are very slightly bigger than cluster 3 - 80-90 m^2 but roughly same number of rooms, and are definitely not bungalows. They also were built in the 2000’s but are slightly newer than cluster 3. They may or may not have elevators but they do tend to not be near a subway. Slightly more than half of the units were owned for over 5 years, and there is an even distribution of floors with slightly less on the bottom.
Based on locations in the map, it looks like cluster 4 is the farthest from the centre. This means it may be more suburban living, which may account for why it is the cheapest but the biggest. Cluster 3 also outside of city centre around outskirts of the downtown, while clusters 1 and 2 are the downtown properties. Cluster 2 takes the longest to sell and is the most expensive because it’s in prime real estate but they are small. It is probably hard to justify buying cluster 2 type units.
For the predictive modeling of this dataset, we will attempt to build a model for “price”, which is the total price over square meters of the property. We think this would be a good indicator for the general affordability in Beijing city as opposed to “totalPrice”. We have divided the dataset into train and test set by the year: all data before 2017 will be the training set and all after 2017 will be test set. This mimics the situation where people use historical data to predict future housing price in Beijing. After all, there is no point in predicting a past transaction because we can simply look it up. Credit: [this report] (https://www.kaggle.com/jonathanbouchet/forecasting-beijing-s-housing).
We also prepared two types of data for training and testing. First is the data within which the factors are not explicitly dummy-coded. This is for our own convenience when writing a formula for the glm models.
data_train_nodummy <- data.frame(data %>% filter(year<2017))
data_test_nodummy <- data.frame(data %>% filter(year>=2017))
The second type of data we need to prepare is the one that explicitly dummy-encodes the factor variables. We need to use this since we will be using advanced modeling package xgboost, which can only handle numeric data. And finally, we have defined 5-folds cross-validation as the evaluation methods for our modeling tasks.
for(col in names(data)) {
if(!is.factor(data[,col])) {
data[,col] = as.numeric(data[,col])
next
}
if(col=="price") next
f = as.formula(paste("price~",col,"-1",sep=""))
m = model.matrix(f,data)
data = data[,!colnames(data) %in% c(col)]
data = cbind(data,m)
}
data_train <- data.frame(data %>% filter(year<2017))
data_test <- data.frame(data %>% filter(year>=2017))
train_x = as.matrix(data_train[,!colnames(data_train) %in% c("price")])
train_y = data_train[,colnames(data_train) %in% c("price")]
test_x = as.matrix(data_test[,!colnames(data_test) %in% c("price")])
test_y = data_test[,colnames(data_test) %in% c("price")]
folds <- createFolds(train_y, k = 5)
train_control <- trainControl(
method = "cv",
index = folds,
verboseIter = F,
allowParallel = TRUE # FALSE for reproducible results
)
We have selected linear regression as our baseline model. A small trick that is applied here is manual feature selection according to the P values produced by the linear regression program. The step to manually select features are: 1. Run glm models on the dataset. 2. Exclude the features that the glm models consider having high P value. 3. Rerun glmnet models using the resulting features after step 2. Steps 1 and 2 are not shown here. Only the results for step 3 is shown. According to the P-value criteria, “living room” and “elevator” does not have significance over the price.
model_glm <- caret::train(
price ~.-livingRoom-elevator, data_train_nodummy,
metric = "Rsquared",
trControl = train_control,
method = "glm",
preProcess = c("zv", "nzv","center","scale")
)
model_glm$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 8983.025 0.7701259 6495.752 16.9356 0.0005795672 9.918903
We see that the linear regression models are not particularly good. In terms of R-squared, it reaches 0.77. The glmnet model, which comes with regularization, is no better than plain glm model. And we won’t bother to show the glmnet model results here. This suggests that we have an underfitting problem resulting from a model that is too simple. We also noticed that we have only experimented a 1st order linear model. But before we attempt to make a higher-order model, there is another problem, that is, which predictors do we choose to combine and make high order predictors? Let’s make some guess for the formula:
model_glm_highorder <- caret::train(
price ~renovationCondition*elevator*communityAverage*subway*floorType,
data_train_nodummy,
metric = "Rsquared",
trControl = train_control,
method = "glm",
preProcess = c("zv", "nzv","center","scale")
)
model_glm_highorder$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 11332.56 0.6341442 8381.932 11.71689 0.0005282009 4.61192
As it turns out, the 5th-order linear model becomes worse. If we go down this path, one big hyper-parameter we have to tune seems to be the number of order and the combination of predictors that make high order predictors. And it seems the only way to do this is an exhaustive search. We will look for two more complex models to see if there are smarter algorithms that can give us better results.
Previously we mentioned that a linear regression has an underfitting problem. We now attempt to fit the dataset using a decision tree model, with the “cp” as the hyperparameter tunning.
model_rpart <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
tuneGrid = expand.grid(
cp = c(0:5/50)
),
trControl = train_control,
method = "rpart",
preProcess = c("zv", "nzv","center","scale")
)
plot(model_rpart)
Small complexity parameter results in larger trees and potential overfitting, large complexity parameter in small trees and potential underfitting. As we can see from the plot of RMSE vs complexity parameter, cp=0 still gets the lowest RMSE. This suggests that the model still suffers from underfitting. We continue to look for some other model that is more complex.
Next, we investigate a powerful but complex model, xgboost. As we will see shortly, out-of-box xgboost on this dataset yield ~15% reduction in RMSE as compared to the decision tree model. With some tuning, we are able to yield 18% reduction in RMSE as compared to the rpart model. We have follows this tutorial in learning how to tune the xgboost model.
The tuning process takes a long time to run. We have run these steps and save all intermediate models in the file systems. The following codes just document the tuning process.
TUNE_XGB = F
if(TUNE_XGB) {
grid_default <- expand.grid(
nrounds = 100,
max_depth = 6,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
xgb_base <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = grid_default,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
nrounds <- 500
# note to start nrounds from 200, as smaller learning rates result in errors so
# big with lower starting points that they'll mess the scales
tune_grid1 <- expand.grid(
nrounds = seq(from = 100, to = nrounds, by = 100),
eta = c(0.05, 0.1, 0.3),
max_depth = c(2, 4, 6, 8),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
xgb_tune1 <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = tune_grid1,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
tune_grid2 <- expand.grid(
nrounds = seq(from = 100, to = nrounds, by = 100),
eta = xgb_tune1$bestTune$eta,
max_depth = xgb_tune1$bestTune$max_depth,
gamma = 0,
colsample_bytree = 1,
min_child_weight = c(1, 2, 3),
subsample = 1
)
xgb_tune2 <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = tune_grid2,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
tune_grid3 <- expand.grid(
nrounds = seq(from = 100, to = nrounds, by = 100),
eta = xgb_tune1$bestTune$eta,
max_depth = xgb_tune1$bestTune$max_depth,
gamma = 0,
colsample_bytree = c(0.33, 0.66, 1.0),
min_child_weight = xgb_tune2$bestTune$min_child_weight,
subsample = c(0.5, 0.75, 1.0)
)
xgb_tune3 <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = tune_grid3,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
tune_grid4 <- expand.grid(
nrounds = seq(from = 100, to = nrounds, by = 100),
eta = xgb_tune1$bestTune$eta,
max_depth = xgb_tune1$bestTune$max_depth,
gamma = c(0, 0.1, 0.5, 0.8, 1.0),
colsample_bytree = xgb_tune3$bestTune$colsample_bytree,
min_child_weight = xgb_tune2$bestTune$min_child_weight,
subsample = xgb_tune3$bestTune$subsample
)
xgb_tune4 <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = tune_grid4,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
tune_grid5 <- expand.grid(
nrounds = seq(from = 100, to = nrounds, by = 100),
eta = seq(xgb_tune1$bestTune$eta,xgb_tune1$bestTune$eta/20,length.out = 5),
max_depth = xgb_tune1$bestTune$max_depth,
gamma = xgb_tune4$bestTune$gamma,
colsample_bytree = xgb_tune3$bestTune$colsample_bytree,
min_child_weight = xgb_tune2$bestTune$min_child_weight,
subsample = xgb_tune3$bestTune$subsample
)
xgb_tune5 <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = tune_grid5,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
write.csv(bestTune,paste(model_dir,"bestXGB.csv",sep = "/"),row.names = F)
# Retrain the best models
grid_best <- expand.grid(
nrounds = bestTune$nrounds,
max_depth = bestTune$max_depth,
eta = bestTune$eta,
gamma = bestTune$gamma,
colsample_bytree = bestTune$colsample_bytree,
min_child_weight = bestTune$min_child_weight,
subsample = bestTune$subsample
)
xgb_best <- caret::train(
x = train_x,
y = train_y,
metric = "Rsquared",
trControl = train_control,
tuneGrid = grid_best,
method = "xgbTree",
verbose = TRUE,
preProcess = c("zv", "nzv","center","scale")
)
saveRDS(xgb_best, paste(model_dir,"xgb.model",sep="/"))
saveRDS(xgb_base, paste(model_dir,"xgb_base.model",sep="/"))
saveRDS(xgb_tune1, paste(model_dir,"xgb_tune1.model",sep="/"))
saveRDS(xgb_tune2, paste(model_dir,"xgb_tune2.model",sep="/"))
saveRDS(xgb_tune3, paste(model_dir,"xgb_tune3.model",sep="/"))
saveRDS(xgb_tune4, paste(model_dir,"xgb_tune4.model",sep="/"))
saveRDS(xgb_tune5, paste(model_dir,"xgb_tune5.model",sep="/"))
}
xgb_best = readRDS(paste(model_dir,"xgb.model",sep="/"))
xgb_base = readRDS(paste(model_dir,"xgb_base.model",sep="/"))
xgb_tune1 = readRDS(paste(model_dir,"xgb_tune1.model",sep="/"))
xgb_tune2 = readRDS(paste(model_dir,"xgb_tune2.model",sep="/"))
xgb_tune3 = readRDS(paste(model_dir,"xgb_tune3.model",sep="/"))
xgb_tune4 = readRDS(paste(model_dir,"xgb_tune4.model",sep="/"))
xgb_tune5 = readRDS(paste(model_dir,"xgb_tune5.model",sep="/"))
model_list = list(v1=xgb_tune1,
v2=xgb_tune2,
v3=xgb_tune3,
v4=xgb_tune4,
v5=xgb_tune5,
xgboost_base=xgb_base
)
resamps <- resamples(model_list)
summary(resamps, metric = "Rsquared")
##
## Call:
## summary.resamples(object = resamps, metric = "Rsquared")
##
## Models: v1, v2, v3, v4, v5, xgboost_base
## Number of resamples: 5
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## v1 0.8859283 0.8860939 0.8875850 0.8869646 0.8875976 0.8876180
## v2 0.8866643 0.8867971 0.8879025 0.8874734 0.8879401 0.8880630
## v3 0.8865105 0.8871361 0.8880237 0.8876447 0.8882601 0.8882931
## v4 0.8866755 0.8867735 0.8881503 0.8877019 0.8882155 0.8886946
## v5 0.8870465 0.8875415 0.8885789 0.8881978 0.8888880 0.8889341
## xgboost_base 0.8788157 0.8794572 0.8811579 0.8804113 0.8811662 0.8814595
## NA's
## v1 0
## v2 0
## v3 0
## v4 0
## v5 0
## xgboost_base 0
dotplot(resamps, metric = "Rsquared")
print(xgb_tune5$bestTune)
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 20 500 8 0.07625 1 1 3
## subsample
## 20 0.75
bestTune = xgb_tune5$bestTune
We have obtained the best hyperparameters for xgboost according to 5-folds cross-validation. A final comparison of Rsquared of three models based on cross-validation is as follows.
model_list = list(glm=model_glm,
rpart=model_rpart,
xgboost_base=xgb_base,
xgboost_best=xgb_best
)
resamps <- resamples(model_list)
summary(resamps, metric = "Rsquared")
##
## Call:
## summary.resamples(object = resamps, metric = "Rsquared")
##
## Models: glm, rpart, xgboost_base, xgboost_best
## Number of resamples: 5
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## glm 0.7695310 0.7697715 0.7698761 0.7701259 0.7705217 0.7709291
## rpart 0.8343046 0.8346839 0.8352170 0.8351240 0.8356974 0.8357174
## xgboost_base 0.8788157 0.8794572 0.8811579 0.8804113 0.8811662 0.8814595
## xgboost_best 0.8869070 0.8872378 0.8884692 0.8881374 0.8889711 0.8891017
## NA's
## glm 0
## rpart 0
## xgboost_base 0
## xgboost_best 0
dotplot(resamps, metric = "Rsquared")
We now apply the models we have previously trained on a “future dataset” - the transaction data from 2017 and beyond. Recall that we have trained the models using data before 2017 and try to predict housing price after 2017. In the following, the true price vs predicted price is plotted. The blue line is the ideal prediction, where all predicted price match 100% with the true price.
model = xgb_best
pred = data.frame('Prediction'= predict(model,test_x),'True' = test_y)
ggplot(data=pred,aes(x=True,y=Prediction)) + geom_jitter() + geom_smooth(method='lm',size=.5) +
theme_minimal(12) + labs(title=paste0('Prediction vs. Ground truth for ',model$method))
xgb_best_test_results = postResample(pred = pred$Prediction, obs = pred$True)
model = model_rpart
pred = data.frame('Prediction'= predict(model,test_x),'True' = test_y)
ggplot(data=pred,aes(x=True,y=Prediction)) + geom_jitter() + geom_smooth(method='lm',size=.5) +
theme_minimal(12) + labs(title=paste0('Prediction vs. Ground truth for ',model$method))
rpart_test_results = postResample(pred = pred$Prediction, obs = pred$True)
model = model_glm
pred = data.frame('Prediction'= predict(model_glm,data_test_nodummy),'True' = data_test_nodummy$price)
ggplot(data=pred,aes(x=True,y=Prediction)) + geom_jitter() + geom_smooth(method='lm',size=.5) +
theme_minimal(12) + labs(title=paste0('Prediction vs. Ground truth for ',model_glm$method))
glm_test_results = postResample(pred = pred$Prediction, obs = pred$True)
df_test_results = data.frame(xgb_best_test_results,rpart_test_results,glm_test_results)
df_test_results
## xgb_best_test_results rpart_test_results glm_test_results
## RMSE 1.861086e+04 1.992175e+04 1.795427e+04
## Rsquared 8.849469e-01 8.126012e-01 8.083139e-01
## MAE 1.650119e+04 1.715371e+04 1.367765e+04
Importance factors from the xgboost models.
mat = xgb.importance (feature_names = colnames(train_x),model = xgb_best$finalModel)
xgb.plot.importance (importance_matrix = mat[1:10])
The models conclude that construction time is the most significant factors in determining the price (/m2). This is both intuitive and but not so intuitive. While it is definitely reasonable to say that construction time is an important factor, it’s not immediately obvious to us that it is the most significant one. There are other factors such as number bathroom, building type, does not even make into the top10, which is counter-intuitive!
An R Shiny app was created for the end user to use - it is located at this link. This interactive app allows users to visualize the Beijing housing market. Users can take a look and see various trends in neighbourhoods to determine where is a good place to buy a house. They can also see the forecasted housing prices to determine if it is a good time for them to purchase or not.
In order to keep this updated, regular maintenance can be expected. This includes refreshing the data set and model as the years go by. It can be updated annually in order to provide predictions for the next year. Another improvement could be adding more features into the model. Some features were not grabbed by the web scraper and they could be added in the future to see if it improves the prediction quality.
From our understanding of this data set, we can see how supervised and unsupervised learning methods can be used to various degrees. In this instance, clustering was performed more for analysis to better understand the Beijing housing market. The supervised methods were used to forecast the housing prices in the future, and also generate a machine perspective with regards to the importance factors in the determining the price of a property.